Get familiar with an IDE (Integrated development environment).
Know how to set up a data research project.
Visualise data: choose an appropriate visualisation for your data; understand ggplot2’s layered approach.
Be able to read in, process, and store data.
A few advanced topics: functions; conditional execution, joins, loops.
In this doc, all code chunks are folded. Readers are encouraged to immediately look at the code for Sections 5 and 6. From Sections 7 onwards, the reader can start to jot down some initial code before seeing the solution.
Let’s manage expectations:
I tailored this workshop for a 2 hours session. Depending on the overall level, we might not cover all the content - that’s ok, you still get this document if you want to explore further.
I won’t be able to enter into any detail about R itself, statistics, or visualisation - and that’s also ok.
The main objective here really is to give you a flavour of a few of the things that can be currently achieved with the kinds of software and hardware we have at our disposal.
I assume that you are now staring at an IDE - be it RStudio or VS Code - as per the email that was shared with you. Please shout if that is not the case!
2 Intro
For this workshop, we borrow heavily from Data Visualization for Social Science. This very markdown is based off the R package accompanying the book. You can use it to take notes, write your code, and produce a good-looking, reproducible document that records the work you have done.
At the very top of the file is a section of metadata, or information about what the file is and what it does. The metadata is delimited by three dashes at the start and another three at the end. You can/should change the title, author, and date to the values that suit you. Keep the output line as it is for now, however. Each line in the metadata has a structure. First the key (title, author, etc.), then a colon, and then the value associated with the key. It is very picky when it comes with indentations, characters used, etc..
2.1 This is a Quarto File
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents, and is also the basis for Quarto.
When you click the Render button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. A code chunk is a specially delimited section of the file. You can add one by moving the cursor to a blank line choosing Code > Insert Chunk from the RStudio menu. When you do, an empty chunk will appear:
Code chunks are delimited by three backticks (found to the left of the 1 key on US and UK keyboards) at the start and end. The opening backticks also have a pair of braces and the letter r, to indicate what language the chunk is written in (this is a clue to the fact that you can have multiple programming language in the document, such as Python, or Bash, or SQL). You write your code inside the code chunks. Write your notes and other material around them, as here.
3 Know your editor
Please have a look at your screen. Some of the most important things to learn now:
Console
Editor
Environment
For this workshop, we rely on RStudio. An alternative you may explore is VS Code.
4 Before you begin to explore and analyse, set up your workspace
A number of functions and packages come already installed in what is generally referred to as R base (and you can actually see what’s in it by typing base:: in your console, and let the auto-complete suggest you the full list of functions … ). However, most of the things we’ll do in this workshop need further libraries. To install them, make sure you have an Internet connection. Then manually run the code in the chunk below. If you just render the document, this will be skipped - however, if this is the first time you run this, make sure to manually run the code chunk below. We do this because you only need to install these packages once, not every time you run this file. Either knit the chunk using the little green “play” arrow to the right of the chunk area, or copy and paste the text into the console window.
4.1 Load Libraries
Once you installed your libraries, each time your run this document you must load them. If we do not load them, R will not be able to find the functions contained in these libraries. The tidyverse includes ggplot2 (for data visualisation) and other libraries such as dplyr (for data manipulation). We also load the socviz and gapminder libraries.
Notice that here an option is set: include=FALSE. This tells R to run this code but not to include the output in the final document.
When you click the Render button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Show the code
gapminder::gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
A final note about the herelibrary. That is a great help if you have properly set up your project as I just indicated. All paths will become relative, and you can forget about remembering where you are in your machine. The essential thing is that it relies on the .RProject file that is created automatically for you when you create a Project with RStudio.
5 Data Visualisation
5.1 Few initial words
R has a consolidated role in data visualisation, well beyond the initial remit of statistics. A few examples:
Let’s go back to the gapminder data we saw before. What kind of variables are we dealing with? Let’s remind ourselves.
Show the code
head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Technically, this data is in a long format, and you can immediately spot it by looking at the repeated values in the first 2 columns. An example of a data in wide format is provided by Eurostat.
We can explore the data in a more structured way, by checking what classes do we have as columns, and get some descriptive statistics.
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
Notice how we use functions in R: we call a function - in the chunk above, str, glimpse, summary - and then include an argument between parentheses - in this case, the entire dataset.
Let’s now start to analyse the data. Say that we have an hypothesis regarding how life expectancy (lifeExp) changes in relations with GDP per capita (gdpPercap). More precisely, we may think that as GDP per capita increases, so should life expectancy. Graphically, if we look back at our data, we should see dots amassing in lower-left and upper-right sections of our graph. As these look like continuous variables, we could plot them in a scatterplot. There are at least two ways to create a plot within ggplot. As you’ll see, there are usually multiple ways to achieve the exact same output with these programming languages.
We can assign a ggplot plot to an object called p (the name doesn’t matter, we can call it mike if we want to), and then add (literally, using +) layers on top of it.
Show the code
p <-ggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp ))p +geom_point() # and you can keep on adding to this, see below ...
Another way of achieving the same thing is as follows, which is more in line with a tidyverse style which heavily rely on pipes (|>).
calling a function on that data means we get the exact same output as the p object above. Thus, we can add a layer on top of that ggplot object, that is the geom_point.
Let’s go back to the graph now. Sometimes, when we face a cloud like the one above, it is useful to plot a trendline on top, to understand the overall pattern.
Show the code
# geom_smooth() using method = 'gam' and formula 'y ~ s(x, bs = "cs")' - GAM stands for generalised additive modelp +geom_point() +geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Perhaps that is too wiggly. Using method='lm' (linear model) as an argument to geom_smooth() includes a simple OLS regression. This also should make you aware of the danger of extrapolating from predictions of inappropriate models.
Show the code
# another way to get the same resultggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp )) +geom_point() +geom_smooth(method ="lm")
`geom_smooth()` using formula = 'y ~ x'
Data is quite bunched up against the left side. Gross Domestic Product per capita is not normally distributed across our country years. The x-axis scale would probably look better if it were transformed from a linear scale to a log scale. For this we can use a function called scale_x_log10(). As you might expect this function scales the x-axis of a plot to a log 10 basis. To use it we just add it to the plot:
`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
The labels on the tick-marks can be controlled through the scale_* functions. Please notice that this we are using another library to have our tick labels in the dollar format.
Show the code
p +geom_point() +geom_smooth(method ="gam") +scale_x_log10(labels = scales::dollar)
`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
The viz above already is way more informative than the ones before, but there is more we can extract from the underlying data. For instance, now it may make sense to re-introduce the lm method we used before.
In passing, also notice the use of different parameters in the viz. The se option commands the standard error, and we can switch it off by imputing se=FALSE. The alpha regulates the transparency of the objects, from 0 to 1. Further, we can add info and make more it publication ready.
Show the code
ggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp )) +geom_point(alpha =0.3) +geom_smooth(color ="orange", se =FALSE, linewidth =1, method ="lm") +scale_x_log10(labels = scales::dollar) +labs(x ="GDP per capita (in log scale)",y ="Life Expectancy in Years",title ="Economic Growth and Life Expectancy",subtitle ="Data points are country-years",caption ="Source: Gapminder." ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Another way to think about the underlying data is to remind ourselves about the groups we have, represented by the continent variable. In the code below, we include colours as an aesthetic, and map it on continents. Notice that, by including also a colour parameter in the geom_smooth function, we are overriding the colour mapping in the ggplot function. This has important consequences, and is a further example of the sequential execution typical of many programming languages (left-right, top-bottom).
Indeed, see what happens when we do not override it. We can immediately appreciate that the slope of our regression is radically different depending on the groups - consider this your introduction to the dangers of pooling.
However, this has become too cluttered, and it’s hard to understand what’s going on with all these colours and lines. We reached saturation and need to revert to a simpler plot.
Show the code
ggplot(data = gapminder,mapping =aes(x = gdpPercap,y = lifeExp )) +geom_point(alpha =0.3) +geom_smooth(method ="lm", colour ="orange", fill ="grey90") +facet_wrap(~continent) +scale_x_log10(labels = scales::dollar) +theme_minimal() +labs(x ="GDP per capita",y ="Life Expectancy in Years",title ="Economic Growth and Life Expectancy",subtitle ="Data points are country-years",caption ="Source: Gapminder." )
`geom_smooth()` using formula = 'y ~ x'
What you observe above is variously termed as a small multiple visualisation, or faceted visualisation. The key point is that we break down the data in meaningful groups, and analyse them separately. This enables us to see more clearly similarities and differences. Remember, all analyses are comparisons.
A radically different take is to exploit the time dimension.
Show the code
gapminder |># subset the data to just the rows with these continentsfilter(continent %in%c("Asia", "Africa", "Europe", "Americas")) |>ggplot(aes(x = year, y = gdpPercap)) +geom_line(color ="gray80", aes(group = country)) +geom_smooth(linewidth =1, method ="loess", se =FALSE) +scale_y_log10(labels = scales::dollar) +facet_wrap(~continent, ncol =2) +theme_minimal() +labs(x ="Year",y ="GDP per capita",title ="GDP per capita on Five Continents" )
`geom_smooth()` using formula = 'y ~ x'
A few takeaways from this.
First, our visualisation obviously brings the reader’s attention to the blue trend line by continent. For instace, the value at the end of the time series for the grouped trend line in Asia is roughly where the start of the grouped time series is in Europe.
Second, pay attention to the larger variation surrounding the trend line in continents such as Asia and Africa compared to the relatively tighter distribution in Europe. Inter alia, this would have substantial consequences were we to fit a model to such data: the larger variance in the former continents would mean larger uncertainty in our estimates compared to the European case. In another case that you may be familiar with, this could happen to parties’ polls, where we could have substantial variation in the polls translating into larger margin of errors in projections.
Show the code
rm(p)
6 UN Votes
Also in this Section the data comes directly from the library, exactly as in the case of gapminder. How many tables are we dealing with?
Show the code
library(unvotes)
If you use data from the unvotes package, please cite the following:
Erik Voeten "Data and Analyses of Voting in the UN General Assembly" Routledge Handbook of International Organization, edited by Bob Reinalda (published May 27, 2013)
Share of votes, by country. Check how you can do it in 2 ways, with a rough appreciation of the timing difference. I’ll walk you through just the tidyverse implementation.
we call the data
we group by a variable of interest
we squash all the data by the grouping variable, and calculate two new variables, namely the total votes, and the number of yes.
we close the grouped calculations by calling off the group_by - that is, ungroup()
As a more advanced topic, if you know you’ll execute that same function over and over, you can re-package that in a function. Notice a few things:
name/position of arguments
default values
calling functions explicitly from namespace
Show the code
# write your first functionsummarise_votes <-function(data_in = un_votes, min_tot_votes =10L) { data_in |> dplyr::summarise(n_votes =n(),n_yes =sum(vote =="yes", na.rm =TRUE) ) |> dplyr::filter(n_votes >= min_tot_votes) |># filter by at least this many votes dplyr::mutate(pct_yes = n_yes / n_votes) |> dplyr::arrange(dplyr::desc(pct_yes)) # arrange by pct_yes}# test, we should get the same answer as above ... -------------------------------#un_votes |>group_by(country) |>summarise_votes() |>ungroup()
Similar to the previous case - the gapminder data - we may be interested in the evolution over time. In order to get that piece of info, we need to introduce another topic, namely merges (or joins). Indeed, votes’ dates are recorded in the other dataset, un_roll_calls. Notice the use of left_join.
Show the code
un_votes <- un_votes |># start with this data dplyr::left_join( # join only the matches in the origin datay = un_roll_calls |># from this data, we keep only the matches in the incoming data from the pipe dplyr::select(rcid, date), # we also subset the right-hand side data on the flyby ="rcid"# we merge by this col ) |> dplyr::mutate(year = lubridate::year(date)) # create a year col# same result, this time in base R# merge(un_votes, un_roll_calls,# by = "rcid",# all.x = TRUE, all.y = TRUE) # left join
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
This can become too cluttered very soon. How do we deal with that? Faceting is your friend. However, depending on the country, the plot may be similar or way off from the aggregate. So, we can remind ourselves in every graph what the overall trend is. Now, you can tweak your code to represent your overall trend in any way you like (e.g. regional data?).
Show the code
un_votes |> dplyr::filter( country %in%c("France", "Germany", "Sweden", "United States","Russia", "China" ) ) |> dplyr::mutate(year = lubridate::year(date)) |> dplyr::group_by(country, year) |># notice we're grouping by 2 cols heresummarise_votes() |>ungroup() |>ggplot() +# notice this is emptygeom_line(aes(x = year, y = pct_yes, # we're passing aes directly to geomsgroup = country, colour = country )) +# here we're providing new data, aggregated on the flygeom_line(data = un_votes |> dplyr::mutate(year = lubridate::year(date)) |> dplyr::group_by(year) |>summarise_votes() |>ungroup(),aes(x = year, y = pct_yes),colour ="grey", linewidth =1 ) +facet_wrap(~country, ncol =2) +scale_y_continuous(labels = scales::label_percent()) +labs(colour ="", x ="Year", y ="Share of yes (%)") +expand_limits(y =0) +theme_minimal() +theme(legend.position ="top")
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2572051 137.4 4140331 221.2 4140331 221.2
Vcells 7797062 59.5 25760601 196.6 25756967 196.6
7 EP Data
7.1 How many documents in each Committee?
Say that you want to organise the Group so that the Committees that receive the most files are also allocated more manpower. How would you go about answering this question?
Find the data.
Load it in your machine.
Process it.
Spit back the result.
Show the code
#------------------------------------------------------------------------------## Download and Process EP Open Data on Plenary Docs ----------------------------#------------------------------------------------------------------------------##' Collects .csv files from the [EP Open Data Portal](https://data.europarl.europa.eu/en/datasets?language=en&order=RELEVANCE&dataThemeFamily=dataset.theme.EP_PLEN_DOC).#' The code then proceeds to tidy such data and write to disk.## Texts tabled: read csv & append it all together -----------------------------docs_csv <-c("https://data.europarl.europa.eu/distribution/plenary-documents_2024_15_en.csv","https://data.europarl.europa.eu/distribution/plenary-documents_2023_52_en.csv","https://data.europarl.europa.eu/distribution/plenary-documents_2022_33_en.csv","https://data.europarl.europa.eu/distribution/plenary-documents_2021_22_en.csv","https://data.europarl.europa.eu/distribution/plenary-documents_2020_14_en.csv","https://data.europarl.europa.eu/distribution/plenary-documents_2019_5_en.csv")# read all .csv at oncedocs_list <-lapply(X = docs_csv, readr::read_csv)# append all .csv together ----------------------------------------------------#plenary_docs <- data.table::rbindlist(l = docs_list, use.names = T, fill = T)#------------------------------------------------------------------------------## Same thing, as a loop -------------------------------------------------------## docs_list <- NULL# for (i_doc in docs_csv) {# print(i_doc)# docs_list[[i_doc]] <- read.csv(file = i_doc) }#------------------------------------------------------------------------------#rm(docs_list, docs_csv)gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2661024 142.2 4140331 221.2 4140331 221.2
Vcells 8494063 64.9 25760601 196.6 25756967 196.6
Extract the year feature, similar to what we did above, so that we can group the data afterwards.
[1] "Bureau of the European Parliament"
[2] "Committee of Inquiry on the Protection of Animals during Transport"
[3] "Committee on Agriculture and Rural Development"
[4] "Committee on Agriculture and Rural Development; Committee on the Environment, Public Health and Food Safety"
[5] "Committee on Budgetary Control"
[6] "Committee on Budgetary Control; Committee on Budgets"
[1] "Special committee on financial crimes, tax evasion and tax avoidance"
[2] "Special Committee on Foreign Interference in all Democratic Processes in the European Union, including Disinformation"
[3] "The Left group in the European Parliament - GUE/NGL"
[4] "The Left group in the European Parliament - GUE/NGL; Group of the Progressive Alliance of Socialists and Democrats in the European Parliament; Group of the Greens/European Free Alliance; European Conservatives and Reformists Group; Renew Europe Group; Group of the European People's Party (Christian Democrats)"
[5] "The Left group in the European Parliament - GUE/NGL; Group of the Progressive Alliance of Socialists and Democrats in the European Parliament; Group of the Greens/European Free Alliance; Renew Europe Group"
[6] "The Left group in the European Parliament - GUE/NGL; Group of the Progressive Alliance of Socialists and Democrats in the European Parliament; Group of the Greens/European Free Alliance; Renew Europe Group; Group of the European People's Party (Christian Democrats)"
There are way too many values here. We need to filter on the string to get just the pattern of interest, namely Committee on.
# A tibble: 64 × 2
document_creator_organization count
<chr> <int>
1 Committee on Budgetary Control 381
2 Committee on Economic and Monetary Affairs 180
3 Committee on the Environment, Public Health and Food Safety 175
4 Committee on Civil Liberties, Justice and Home Affairs 154
5 Committee on Foreign Affairs 124
6 Committee on Legal Affairs 124
7 Committee on Budgets 103
8 Committee on International Trade 91
9 Committee on Transport and Tourism 79
10 Committee on Industry, Research and Energy 72
# ℹ 54 more rows
However, we’re not done here. If we actually check the tail of the table above - by simply running tail() -, we can notice something weird.